fix: add read deadline to tls write by bvalente · Pull Request #3283 · IBM/sarama

bvalente · 2025-09-08T15:00:02Z

Related to:

We're using https://github.com/Mongey/terraform-provider-kafka to manage Kafka Topics with Terraform. Recently we've changed from Plaintext communications to AWS IAM Authentication. When doing so, our provider sometimes would hang indefinitely on some plans. We pinned this to the kafka.t3.small cluster tiers, as these have several limitations, including a maximum of 4 TCP connections per second.

While debugging the provider, we understood that the Call Stack was stuck on writing to the cluster, more specifically right on the first communication that it was trying to do with the clusters. Reading through the code, we found a very interesting comment for the Write function of the TLS package.

https://github.com/golang/go/blob/go1.23.0/src/crypto/tls/conn.go#L1192-L1195

// As Write calls [Conn.Handshake], in order to prevent indefinite blocking a deadline
// must be set for both [Conn.Read] and Write before Write is called when the handshake
// has not yet completed. See [Conn.SetDeadline], [Conn.SetReadDeadline], and
// [Conn.SetWriteDeadline].

Based on this, TLS requires both Write and Read Deadlines to be set because the Write function may do a handshake on the fist communication, and the handshake both Writes and Reads.

I believe that in our case, since we are working with brokers that don't have a very reliable network, sometimes the handshake would not progress on the server side, and we would indefinitely wait for a Read that would never come.

After implementing this change in our local workstation, instead of experiencing indefinite hanging, the program would finally report some time of error:

Error: kafka: client has run out of available brokers to talk to: read tcp 10.xxx.xxx.xxx:59582->10.xxx.xxx.xxx:9098: i/o timeout

puellanivis

Only thing I could think is to join the time.Now() calls into a common local variable, so they’re both based on the same “now”.

But yeah, good change all over otherwise. 👍

Signed-off-by: Bernardo Valente <bernardofvalente@gmail.com>

bvalente · 2025-09-09T16:55:40Z

@puellanivis thank you for the review

I addressed your comment, and force pushed after rebasing with master

puellanivis

Looks great. :)

bvalente · 2025-09-15T08:44:17Z

Hello @puellanivis, what would be the process to get this merged and tagged? Is there a timeline, or anything I can do from our side? 🙂

puellanivis · 2025-09-15T11:51:14Z

Sometimes reviews from IBM can take a while. I don’t actually have any ability to even approve in my code review, let alone merge anything. I’m just a third-party F/OSS contributor helping out with code reviews.

dnwe

@bvalente thanks! this was a good catch

Related to: - golang/go#13828 - IBM#1722 We're using https://github.com/Mongey/terraform-provider-kafka to manage Kafka Topics with Terraform. Recently we've changed from Plaintext communications to AWS IAM Authentication. When doing so, our provider sometimes would hang indefinitely on some plans. We pinned this to the `kafka.t3.small` cluster tiers, as these have several limitations, including a maximum of 4 TCP connections per second. While debugging the provider, we understood that the Call Stack was stuck on writing to the cluster, more specifically right on the first communication that it was trying to do with the clusters. Reading through the code, we found a very interesting comment for the Write function of the TLS package. https://github.com/golang/go/blob/go1.23.0/src/crypto/tls/conn.go#L1192-L1195 ``` // As Write calls [Conn.Handshake], in order to prevent indefinite blocking a deadline // must be set for both [Conn.Read] and Write before Write is called when the handshake // has not yet completed. See [Conn.SetDeadline], [Conn.SetReadDeadline], and // [Conn.SetWriteDeadline]. ``` Based on this, TLS requires both Write and Read Deadlines to be set because the Write function may do a handshake on the fist communication, and the handshake both Writes and Reads. I believe that in our case, since we are working with brokers that don't have a very reliable network, sometimes the handshake would not progress on the server side, and we would indefinitely wait for a Read that would never come. After implementing this change in our local workstation, instead of experiencing indefinite hanging, the program would finally report some time of error: ``` Error: kafka: client has run out of available brokers to talk to: read tcp 10.xxx.xxx.xxx:59582->10.xxx.xxx.xxx:9098: i/o timeout ``` Signed-off-by: Bernardo Valente <bernardofvalente@gmail.com>

* Fix data race on Broker.done channel (IBM#2698) The underlying case was not waiting for the goroutine running the `responseReceiver()` method to fully complete if SASL authentication failed. This created a window where a further call to `Broker.Open()` could overwrite the `Broker.done` channel value while the goroutine still running `responseReceiver()` was trying to close the same channel. Fixes: IBM#2382 Signed-off-by: Adrian Preston <PRESTONA@uk.ibm.com> * fix: add read deadline to tls write (IBM#3283) Related to: - golang/go#13828 - IBM#1722 We're using https://github.com/Mongey/terraform-provider-kafka to manage Kafka Topics with Terraform. Recently we've changed from Plaintext communications to AWS IAM Authentication. When doing so, our provider sometimes would hang indefinitely on some plans. We pinned this to the `kafka.t3.small` cluster tiers, as these have several limitations, including a maximum of 4 TCP connections per second. While debugging the provider, we understood that the Call Stack was stuck on writing to the cluster, more specifically right on the first communication that it was trying to do with the clusters. Reading through the code, we found a very interesting comment for the Write function of the TLS package. https://github.com/golang/go/blob/go1.23.0/src/crypto/tls/conn.go#L1192-L1195 ``` // As Write calls [Conn.Handshake], in order to prevent indefinite blocking a deadline // must be set for both [Conn.Read] and Write before Write is called when the handshake // has not yet completed. See [Conn.SetDeadline], [Conn.SetReadDeadline], and // [Conn.SetWriteDeadline]. ``` Based on this, TLS requires both Write and Read Deadlines to be set because the Write function may do a handshake on the fist communication, and the handshake both Writes and Reads. I believe that in our case, since we are working with brokers that don't have a very reliable network, sometimes the handshake would not progress on the server side, and we would indefinitely wait for a Read that would never come. After implementing this change in our local workstation, instead of experiencing indefinite hanging, the program would finally report some time of error: ``` Error: kafka: client has run out of available brokers to talk to: read tcp 10.xxx.xxx.xxx:59582->10.xxx.xxx.xxx:9098: i/o timeout ``` Signed-off-by: Bernardo Valente <bernardofvalente@gmail.com> * fix(client): ignore empty Metadata responses when refreshing (IBM#2672) We should skip the metadata refresh if the startup phase broker returns empty brokers in metadata response. The Java client skips the empty response to update the metadata cache (https://github.com/apache/kafka/blob/trunk/clients/src/main/java/org/apache/kafka/clients/NetworkClient.java#L1149) and we should make a feature parity in Sarama too Fixes IBM#2664 Signed-off-by: Hao Sun <haos@uber.com> --------- Signed-off-by: Adrian Preston <PRESTONA@uk.ibm.com> Signed-off-by: Bernardo Valente <bernardofvalente@gmail.com> Signed-off-by: Hao Sun <haos@uber.com> Co-authored-by: Adrian Preston <prestona@users.noreply.github.com> Co-authored-by: bvalente <bernardofvalente@gmail.com> Co-authored-by: HaoSunUber <86338940+HaoSunUber@users.noreply.github.com>

puellanivis reviewed Sep 9, 2025

View reviewed changes

fix: add read deadline to tls write

7175b87

Signed-off-by: Bernardo Valente <bernardofvalente@gmail.com>

bvalente force-pushed the tls-deadline branch from 39e2a66 to 7175b87 Compare September 9, 2025 16:53

puellanivis reviewed Sep 10, 2025

View reviewed changes

dnwe approved these changes Sep 15, 2025

View reviewed changes

dnwe added the fix label Sep 15, 2025

dnwe merged commit 25368c4 into IBM:main Sep 15, 2025
17 checks passed

bvalente mentioned this pull request Sep 19, 2025

Bump github.com/IBM/sarama from 1.45.2 to 1.46.1 Mongey/terraform-provider-kafka#601

Merged

campaigner-prod Bot mentioned this pull request Apr 23, 2026

chore(deps): eol unstable upgrades — 9 packages (unstable: 2 · minor: 7) [sticker-award] DataDog/stickerlandia#253

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: add read deadline to tls write#3283

fix: add read deadline to tls write#3283
dnwe merged 1 commit intoIBM:mainfrom
bvalente:tls-deadline

bvalente commented Sep 8, 2025

Uh oh!

puellanivis left a comment

Uh oh!

bvalente commented Sep 9, 2025

Uh oh!

puellanivis left a comment

Uh oh!

bvalente commented Sep 15, 2025

Uh oh!

puellanivis commented Sep 15, 2025

Uh oh!

dnwe left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

bvalente commented Sep 8, 2025

Uh oh!

puellanivis left a comment

Choose a reason for hiding this comment

Uh oh!

bvalente commented Sep 9, 2025

Uh oh!

puellanivis left a comment

Choose a reason for hiding this comment

Uh oh!

bvalente commented Sep 15, 2025

Uh oh!

puellanivis commented Sep 15, 2025

Uh oh!

dnwe left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants